Analyzing NYC High School Data

Posted on February 09, 2017 in posts

Analyzing The NYC High School SAT Data

Here I will explore the relationship between SAT scores and demographic factors in New York City public schools. For a brief bit of background, the SAT, or Scholastic Aptitude Test, is a test given to graduating high schoolers in the US every year. The SAT has 3 sections, each of which is worth a maximum of 800 points. The SAT is used by colleges to determine which students to admit. High average SAT scores are usually indicative of a good school.

New York City has published data on the SAT scores of students, along with additional demographic datasets. In the last three missions, we combined the following datasets into a single, clean, Pandas Dataframe:

SAT scores by school -- SAT scores for each high school in New York City. School attendance -- attendance information on every school in NYC. Class size -- class size information for each school in NYC. AP test results -- Advanced Placement exam results for each high school. Passing AP exams can get you college credit in the US. Graduation outcomes -- percentage of students who graduated, and other outcome information. Demographics -- demographic information for each school. School survey -- surveys of parents, teachers, and students at each school. New York City has a significant immigrant population, and is very diverse, so comparing demographic factors such as race, income, and gender with SAT scores is a good way to figure out if the SAT is a fair test. If certain racial groups consistently performed better on the SAT, we would have some evidence that the SAT is unfair, for example.

The Dataframe combined contains all of our data, and is what we'll be using in our analysis.

Read in the data

In [1]:
import matplotlib as plt
plt.use('qt4agg')
import pandas as pd
import numpy as np
import re

data_files = [
    "ap_2010.csv",
    "class_size.csv",
    "demographics.csv",
    "graduation.csv",
    "hs_directory.csv",
    "sat_results.csv"
]

data = {}

for f in data_files:
    d = pd.read_csv("schools/{0}".format(f))
    data[f.replace(".csv", "")] = d

Read in the surveys

In [2]:
all_survey = pd.read_csv("schools/survey_all.txt", delimiter="\t", encoding='windows-1252')
d75_survey = pd.read_csv("schools/survey_d75.txt", delimiter="\t", encoding='windows-1252')
survey = pd.concat([all_survey, d75_survey], axis=0)

survey["DBN"] = survey["dbn"]

survey_fields = [
    "DBN", 
    "rr_s", 
    "rr_t", 
    "rr_p", 
    "N_s", 
    "N_t", 
    "N_p", 
    "saf_p_11", 
    "com_p_11", 
    "eng_p_11", 
    "aca_p_11", 
    "saf_t_11", 
    "com_t_11", 
    "eng_t_10", 
    "aca_t_11", 
    "saf_s_11", 
    "com_s_11", 
    "eng_s_11", 
    "aca_s_11", 
    "saf_tot_11", 
    "com_tot_11", 
    "eng_tot_11", 
    "aca_tot_11",
]
survey = survey.loc[:,survey_fields]
data["survey"] = survey

Add DBN columns

In [3]:
data["hs_directory"]["DBN"] = data["hs_directory"]["dbn"]

def pad_csd(num):
    string_representation = str(num)
    if len(string_representation) > 1:
        return string_representation
    else:
        return "0" + string_representation
    
data["class_size"]["padded_csd"] = data["class_size"]["CSD"].apply(pad_csd)
data["class_size"]["DBN"] = data["class_size"]["padded_csd"] + data["class_size"]["SCHOOL CODE"]

Convert columns to numeric

In [4]:
cols = ['SAT Math Avg. Score', 'SAT Critical Reading Avg. Score', 'SAT Writing Avg. Score']
for c in cols:
    data["sat_results"][c] = pd.to_numeric(data["sat_results"][c], errors="coerce")

data['sat_results']['sat_score'] = data['sat_results'][cols[0]] + data['sat_results'][cols[1]] + data['sat_results'][cols[2]]

def find_lat(loc):
    coords = re.findall("\(.+, .+\)", loc)
    lat = coords[0].split(",")[0].replace("(", "")
    return lat

def find_lon(loc):
    coords = re.findall("\(.+, .+\)", loc)
    lon = coords[0].split(",")[1].replace(")", "").strip()
    return lon

data["hs_directory"]["lat"] = data["hs_directory"]["Location 1"].apply(find_lat)
data["hs_directory"]["lon"] = data["hs_directory"]["Location 1"].apply(find_lon)

data["hs_directory"]["lat"] = pd.to_numeric(data["hs_directory"]["lat"], errors="coerce")
data["hs_directory"]["lon"] = pd.to_numeric(data["hs_directory"]["lon"], errors="coerce")

Condense datasets

In [5]:
class_size = data["class_size"]
class_size = class_size[class_size["GRADE "] == "09-12"]
class_size = class_size[class_size["PROGRAM TYPE"] == "GEN ED"]

class_size = class_size.groupby("DBN").agg(np.mean)
class_size.reset_index(inplace=True)
data["class_size"] = class_size

data["demographics"] = data["demographics"][data["demographics"]["schoolyear"] == 20112012]

data["graduation"] = data["graduation"][data["graduation"]["Cohort"] == "2006"]
data["graduation"] = data["graduation"][data["graduation"]["Demographic"] == "Total Cohort"]

Convert AP scores to numeric

In [6]:
cols = ['AP Test Takers ', 'Total Exams Taken', 'Number of Exams with scores 3 4 or 5']

for col in cols:
    data["ap_2010"][col] = pd.to_numeric(data["ap_2010"][col], errors="coerce")

Combine the datasets

In [7]:
combined = data["sat_results"]

combined = combined.merge(data["ap_2010"], on="DBN", how="left")
combined = combined.merge(data["graduation"], on="DBN", how="left")

to_merge = ["class_size", "demographics", "survey", "hs_directory"]

for m in to_merge:
    combined = combined.merge(data[m], on="DBN", how="inner")

combined = combined.fillna(combined.mean())
combined = combined.fillna(0)

Add a school district column for mapping

In [8]:
def get_first_two_chars(dbn):
    return dbn[0:2]

combined["school_dist"] = combined["DBN"].apply(get_first_two_chars)

Find correlations

In [9]:
correlations = combined.corr()
correlations = correlations["sat_score"]
print(correlations)
SAT Critical Reading Avg. Score         0.986820
SAT Math Avg. Score                     0.972643
SAT Writing Avg. Score                  0.987771
sat_score                               1.000000
AP Test Takers                          0.523140
Total Exams Taken                       0.514333
Number of Exams with scores 3 4 or 5    0.463245
Total Cohort                            0.325144
CSD                                     0.042948
NUMBER OF STUDENTS / SEATS FILLED       0.394626
NUMBER OF SECTIONS                      0.362673
AVERAGE CLASS SIZE                      0.381014
SIZE OF SMALLEST CLASS                  0.249949
SIZE OF LARGEST CLASS                   0.314434
SCHOOLWIDE PUPIL-TEACHER RATIO               NaN
schoolyear                                   NaN
fl_percent                                   NaN
frl_percent                            -0.722225
total_enrollment                        0.367857
ell_num                                -0.153778
ell_percent                            -0.398750
sped_num                                0.034933
sped_percent                           -0.448170
asian_num                               0.475445
asian_per                               0.570730
black_num                               0.027979
black_per                              -0.284139
hispanic_num                            0.025744
hispanic_per                           -0.396985
white_num                               0.449559
                                          ...   
rr_p                                    0.047925
N_s                                     0.423463
N_t                                     0.291463
N_p                                     0.421530
saf_p_11                                0.122913
com_p_11                               -0.115073
eng_p_11                                0.020254
aca_p_11                                0.035155
saf_t_11                                0.313810
com_t_11                                0.082419
eng_t_10                                     NaN
aca_t_11                                0.132348
saf_s_11                                0.337639
com_s_11                                0.187370
eng_s_11                                0.213822
aca_s_11                                0.339435
saf_tot_11                              0.318753
com_tot_11                              0.077310
eng_tot_11                              0.100102
aca_tot_11                              0.190966
grade_span_max                               NaN
expgrade_span_max                            NaN
zip                                    -0.063977
total_students                          0.407827
number_programs                         0.117012
priority08                                   NaN
priority09                                   NaN
priority10                                   NaN
lat                                    -0.121029
lon                                    -0.132222
Name: sat_score, dtype: float64
In [10]:
%matplotlib inline
import matplotlib.pyplot as plt
import seaborn as sns
##combined.corr()["sat_score"][survey_fields].plot.bar()
##correlation[survey_fields].plot.bar
pal = sns.cubehelix_palette(25, start=2, rot=0, dark=0, light=.95, reverse=True)
by_corr = sns.barplot(x=survey_fields, y=correlations[survey_fields], palette=pal)
for item in by_corr.get_xticklabels():
    item.set_rotation(90)
sns.despine(left=True, bottom=True)
In [11]:
##combined.plot.scatter(x='saf_s_11', y='sat_score')
sns.set_style("dark")
sns.regplot(x='saf_s_11', y='sat_score', data=combined)
sns.despine(left=True, bottom=True)

Founding:

The satifaction of student has a weak positive correlation with student SAT scores for most high schools with corr = 0.34

In [12]:
import folium
from folium import plugins

schools_map = folium.Map(location=[combined['lat'].mean(), combined['lon'].mean()], tiles='Stamen Toner', zoom_start=10)
marker_cluster = folium.MarkerCluster().add_to(schools_map)
for name, row in combined.iterrows():
    folium.Marker([row["lat"], row["lon"]], popup="{0}:{1}".format(row["DBN"], row["school_name"])).add_to(marker_cluster)
schools_map.save('schools.html')
schools_map
Out[12]:
In [13]:
schools_heatmap = folium.Map(location=[combined['lat'].mean(), combined['lon'].mean()],tiles='Mapbox Bright', zoom_start=10)
schools_heatmap.add_children(plugins.HeatMap([[row["lat"], row["lon"]] for name, row in combined.iterrows()]))
schools_heatmap.save("heatmap.html")
schools_heatmap
Out[13]:
In [14]:
districts_d = combined.groupby("school_dist").agg(np.mean)
districts_d.reset_index(inplace=True)
districts_d["school_dist"] = districts_d["school_dist"].apply(lambda x: str(int(x)))

def show_district_map(col):
    geo_path = 'schools/districts.geojson'
    districts = folium.Map(location=[combined['lat'].mean(), combined['lon'].mean()], zoom_start=10)
    data = combined
    threshold = np.linspace(data[col].min(),
                              data[col].max(),
                              6, dtype=int)
    threshold = threshold.tolist()
    districts.choropleth(
        geo_path=geo_path,
        data=districts_d,
        columns=['school_dist', col],
        key_on='feature.properties.school_dist',
        fill_color='YlGn',
        fill_opacity=0.7,
        line_opacity=0.2,
        legend_name='Safty level of school',
        #threshold_scale=threshold,
    )
    ##districts.save("districts.html")
    return districts

show_district_map("saf_s_11")
/Users/shwangMac/anaconda/envs/py3k/lib/python3.6/site-packages/ipykernel/__main__.py:21: FutureWarning: 'threshold_scale' default behavior has changed. Now you get a linear scale between the 'min' and the 'max' of your data. To get former behavior, use folium.utilities.split_six.
Out[14]:

Lower Brooklyn and Astoria have the highest safty score, while parts of Bronx and conjuction area of Brookly and Queens are less safty for students.

In [15]:
show_district_map("sat_score")
/Users/shwangMac/anaconda/envs/py3k/lib/python3.6/site-packages/ipykernel/__main__.py:21: FutureWarning: 'threshold_scale' default behavior has changed. Now you get a linear scale between the 'min' and the 'max' of your data. To get former behavior, use folium.utilities.split_six.
Out[15]:

As we can see by comparing the two district level maps, it seem safe neighbourhood tend to have a higher SAT score.

In [16]:
import matplotlib.pyplot as plt
import seaborn as sns
races = ['white_per','asian_per','black_per','hispanic_per']
##correlations[races].plot.bar()
sns.set_style('whitegrid')
pal = sns.cubehelix_palette(4, start=.5, rot=-.75)
sns.barplot(x=races, y=correlations[races], palette = pal)
sns.despine(left=True)
In [17]:
sns.jointplot(x=combined['white_per'], y=combined['sat_score'])
sns.jointplot(x=combined['asian_per'], y=combined['sat_score'])
sns.jointplot(x=combined['black_per'], y=combined['sat_score'])
sns.jointplot(x=combined['hispanic_per'], y=combined['sat_score'])
Out[17]:
In [18]:
show_district_map("asian_per")
/Users/shwangMac/anaconda/envs/py3k/lib/python3.6/site-packages/ipykernel/__main__.py:21: FutureWarning: 'threshold_scale' default behavior has changed. Now you get a linear scale between the 'min' and the 'max' of your data. To get former behavior, use folium.utilities.split_six.
Out[18]:
In [19]:
show_district_map("hispanic_per")
/Users/shwangMac/anaconda/envs/py3k/lib/python3.6/site-packages/ipykernel/__main__.py:21: FutureWarning: 'threshold_scale' default behavior has changed. Now you get a linear scale between the 'min' and the 'max' of your data. To get former behavior, use folium.utilities.split_six.
Out[19]:
It looks like a higher percentage of white or asian students at a school correlates a strong positively with sat score, whereas a higher percentage of black or hispanic students has a negative correlation with sat score.
In [20]:
print(combined[combined['hispanic_per'] > 95]['SCHOOL NAME'])
44                         MANHATTAN BRIDGES HIGH SCHOOL
82      WASHINGTON HEIGHTS EXPEDITIONARY LEARNING SCHOOL
89     GREGORIO LUPERON HIGH SCHOOL FOR SCIENCE AND M...
125                  ACADEMY FOR LANGUAGE AND TECHNOLOGY
141                INTERNATIONAL SCHOOL FOR LIBERAL ARTS
176     PAN AMERICAN INTERNATIONAL HIGH SCHOOL AT MONROE
253                            MULTICULTURAL HIGH SCHOOL
286               PAN AMERICAN INTERNATIONAL HIGH SCHOOL
Name: SCHOOL NAME, dtype: object
These schools have a lot of students who are learning English, which would make sence of the lower SAT scores. Their mission appear to primarily be geard toward recent immigrants to the US.
In [21]:
print(combined[(combined['hispanic_per'] < 10) & (combined['sat_score'] > 1800)]['SCHOOL NAME'])
37                                STUYVESANT HIGH SCHOOL
151                         BRONX HIGH SCHOOL OF SCIENCE
187                       BROOKLYN TECHNICAL HIGH SCHOOL
327    QUEENS HIGH SCHOOL FOR THE SCIENCES AT YORK CO...
356                  STATEN ISLAND TECHNICAL HIGH SCHOOL
Name: SCHOOL NAME, dtype: object

Many of the schools above appear to be specialized science and technology schools and only admit students who pass an very competitive entrance exam. It explain why their students do better on the SAT -- they are students from all over New York City who did well on a standardized test.

In [22]:
correlations[['male_per', 'female_per']].plot.bar()
Out[22]:

In the plot above, we can see that a higher percentage of females at a school has a positive correlates with SAT score, while the percentage of males at a school correlates with SAT score negatively. Neither correlation is extremely strong.

In [23]:
combined.plot.scatter(x='male_per', y='sat_score')
combined.plot.scatter(x='female_per', y='sat_score')
Out[23]:

According to scatterplot, there doesn't seem to be any real correlation between sat_score and female_per. However, there is a big chuck of schools with a high percentage of females between 60 and 70) that have high SAT scores.

In [24]:
print(combined[(combined['female_per']> 60)&(combined['sat_score']>1700)]['SCHOOL NAME'])
5                         BARD HIGH SCHOOL EARLY COLLEGE
26                         ELEANOR ROOSEVELT HIGH SCHOOL
60                                    BEACON HIGH SCHOOL
61     FIORELLO H. LAGUARDIA HIGH SCHOOL OF MUSIC & A...
302                          TOWNSEND HARRIS HIGH SCHOOL
Name: SCHOOL NAME, dtype: object

These schools appears to be very selective liberal arts schools and maintain high academic standards.

In [25]:
combined['ap_per'] = combined['AP Test Takers ']/combined['total_enrollment']

combined.plot.scatter(x='ap_per', y='sat_score')
Out[25]:

It appears there is a relationship between the percentage of students in a school who take the AP exam and their average SAT scores. It's not a very strong correlation.